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FASTSUBS: An Efficient and Exact Procedure for 
Finding the Most Likely Lexical Substitutes Based 
on an N-gram Language Model 
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Abstract — Lexical substitutes have found use in areas such 
as paraphrasing, text simplification, machine translation, word 
sense disambiguation, and part of speech induction. However 
the computational complexity of accurately identifying the most 
likely substitutes for a word has made large scale experiments 
difficult. In this paper I introduce a new search algorithm, 
fastsubs , that is guaranteed to find the K most likely lexical 
substitutes for a given word in a sentence based on an n-gram 
language model. The computation is sub-linear in both K and 
the vocabulary size V. An implementation of the algorithm and 
a dataset with the top 100 substitutes of each token in the WSJ 
section of the Penn Treebank are available at http://goo.gl/jzKH0, 



EDICS Category: SPE-LANG 

I. Introduction 

Lexical substitutes have proven useful in applications such 
as paraphrasing [1 |, text simplification [2], and machine trans- 
lation [3|. Best published results in unsupervised word sense 
disambiguation [4|, and part of speech induction [5 j represent 
word context as a vector of substitute probabilities. Using a 
statistical language model to find the most likely substitutes of 
a word in a given context is a successful approach (J6), Q). 
However the computational cost of an exhaustive algorithm, 
which computes the probability of every word before deciding 
the top K, makes large scale experiments difficult. On the 
other hand, heuristic methods run the risk of missing important 
substitutes. 

This paper presents the FASTSUBS algorithm which can 
efficiently and correctly identify the most likely lexical substi- 
tutes for a given context based on an n-gram language model 
without going through most of the vocabulary. Even though 
the worst-case performance of FASTSUBS is still proportional 
to vocabulary size, experiments demonstrate that the average 
cost is sub-linear in both the number of substitutes K and the 
vocabulary size V. To my knowledge, this is the first sub- 
linear algorithm that exactly identifies the top K most likely 
lexical substitutes. 

The efficiency of FASTSUBS makes large scale experiments 
based on lexical substitutes feasible. For example, it is possible 
to compute the top 100 substitutes for each one of the 
1,173,766 tokens in the WSJ section of the Penn Treebank 
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(8) in under 5 hours on a typical workstation. The same 
task would take about 6 days with the exhaustive algorithm. 
The Penn Treebank substitute data and an implementation 
of the algorithm are available from the author's website at 
|http://goo.gl/jzKH0[ 

Section [TT] derives substitute probabilities as defined by an 
n-gram language model with an arbitrary order and smoothing . 
Section [III] describes the FASTSUBS algorithm. Section [IV] 
proves the correctness of the algorithm and Section IVlpresents 
experimental results on its time complexity. Section [VI] sum- 
marizes the contributions of this paper. 

II. Substitute Probabilities 

This section presents the derivation of lexical substitute 
probabilities based on an n-gram language model. Details of 
this derivation are important in finding an admissible algorithm 
that identifies the most likely substitutes efficiently, without 
trying out most of the vocabulary. 

N-gram language models assign probabilities to arbitrary 
sequences of words (or other tokens like punctuation etc.) 
based on their occurrence statistics in large training corpora. 
They approximate the probability of a sequence of words by 
assuming each word is conditionally independent of the rest 
given the previous (n— 1) words. For example a trigram model 
would approximate the probability of a sequence abode as: 

p(abcde) — p(a)p(b\a)p(c\ab)p(d\bc)p(e\cd) (1) 

where lowercase letters like a, b, c represent words and 
strings of letters like abode represent word sequences. The 
computation is typically performed using log probabilities, 
which turns the product into a summation: 

£(abcde) = £{a) + £(b\a) + £{c\ab) + £(d\bc) + £{e\cd) (2) 

where l(x) = \ogp(x). The individual conditional probability 
terms are typically expressed in back-off formfl 



£{c\ab) 



a(abc) 

p(ab) + £(c\b) 



if f(abc) > 
otherwise 



(3) 



where a(abc) is the discounted log probability estimate for 
£(c\ab) (typically slightly less than the log frequency in the 
training corpus), f(abc) is the number of times abc has been 
observed in the training corpus, f3(ab) is the back-off weight 

'Even interpolated models can be represented in the back-off form and in 
fact that is the way SRILM stores them in ARPA (Doug Paul) format model 
files. 
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to keep the probabilities add up to 1. The formula can be 
generalized to arbitrary n-gram orders if we let b stand for 
zero or more words. The recursion bottoms out at unigrams 
(single words) where 1(c) — a(c). If there are any out-of- 
vocabulary words we assume they are mapped to a special 
(unk) token, so a(c) is never undefined. 

It is best to use both left and right context when estimating 
the probabilities for potential lexical substitutes. For example, 
in "He lived in San Francisco suburbs.", the token San would 
be difficult to guess from the left context but it is almost certain 
looking at the right context. The log probability of a substitute 
word given both left and right contexts can be estimated as: 



£(x\ab_de) oc £(abxde) 

oc i(x\ab) + £{d\bx) 



(4) 



-£(e\xd) 



Here the "_" symbol represents the position the candidate 
substitute x is going to occupy. The first line follows from 
the definition of conditional probability and the second line 
comes from Equation Q] except the terms that do not include 
the candidate x have been dropped. 

The expression for the unnormalized log probability of a 
lexical substitute according to Equation [4] and the decompo- 
sition of its terms according to Equation [3] can be combined 
to give us Equation [5] For arbitrary order n-gram models we 
would end up with a sum of n terms and each term would 
come from one of n alternatives. 



£(x\ab_de) oc 



(5) 



a(abx) 

P(ab) + a(bx) 
f3{ab) + /3(b) + a{x) 

a(bxd) 

a(xd) 

f3(x) + a(d) 



/3(bx) - 
p\bx) - 

a(xde) 
0(xd) - 
/3 (xd) - 



a(de) 
P(d) - 



a(e) 



if f(abx) > 
if /(6a;) > 
otherwise 

if f(bxd) > 
if f (xd) > 
otherwise 

if f(xde) > 
if f (de) > 
otherwise 



III. Algorithm 



The task of FASTSUBS is to pick the top K substitutes (x) 
from a vocabulary of size V that maximize Equation [5] for a 
given context ab_de. Equation [5] forms a tree where leaf nodes 
are primitive terms such as f3(bx), a(xd), and parent nodes 
are compound terms, i.e. sums or conditional expressions. The 
basic strategy is to construct a priority queue of candidate 
substitutes for Equation [5] by composing substitute queues 
for each of its sub-expressions. The structure of these queues 
and how they can be composed is described next, followed 
by the construction of the individual queues for each of the 
subexpressions. 

A. Upper bound queues 

A sum such as (3(bx) + a(xd) is not necessarily maximized 
by the x's that maximize either of its terms. What we can 



say for sure is that the sum for any x cannot exceed the 
upper bound /3(bxi) + a(x2d) where x\ maximizes (3(bx) 
and X2 maximizes a(xd). We can find the x that maximizes 
the sum by repeatedly evaluating candidates until we find one 
whose value is (i) larger than all the candidates that have 
been evaluated, and (ii) larger than the upper bound for the 
remaining candidates. 

Based on this intuition, we define an abstract data type 
called an upper bound queue that maintains an upper bound 
on the actual values of its elements. Each successive pop 
from an upper bound queue is not guaranteed to retrieve the 
element with the largest value, but the remaining elements 
are guaranteed to have values smaller than or equal to a non- 
increasing upper bound. An upper bound queue supports three 
operations: 

• SUP(g): returns an upper bound on the value of the 
elements in the queue. 

• TOP(g): returns the top element in the queue. Note that 
this element is not guaranteed to have the highest value. 

• POP(g): extracts and returns the top element in the queue 
and updates the upper bound if possible. 

Upper bound queues can be composed easily. Going back 
to our sum example let us assume that we have valid upper 
bound queues q a for a(xd) and qp for f3(bx). The queue q a for 
the sum (/3(bx) + a(xd)) has SUP(g CT ) = SUP(q Q ) + SVP(qp) 
because the upper bound for a sum clearly cannot exceed the 
total of the upper bounds for its constituent terms. TOP(q a ) 
can return any element from the queue without violating the 
contract. However in order to find the true maximum, we 
eventually need an element whose value exceeds the upper 
bound for the remaining elements. Thus we can bias our 
choice for TOP(q a ) to prefer elements that (i) have high values, 
and (ii) reduce the upper bound quickly. In practice non- 
deterministic ally picking TOP(q a ) to be one of TOP(q Q ) or 
TOP(q 1 g) works well. POP(<7 CT ) can extract and return the same 
element from the corresponding child queue. If the upper 
bound of a child queue drops as a result, so does the upper 
bound of the compound queue q a . 

B. Top level queue 

The top level sum in Equation is a sum of N conditional 
expressions for an order N language model. We can construct 
an upper bound queue for the sum using the upper bound 
queues for its constituent terms as described in the previous 
section. Let q represent the queue for the top level sum, 
5 6 C represent the constituent conditional expressions and 
qs represent their associated queues. 



SUP(g) 
TOP(g) 



E 

sec 



SV?(q s ) 
TOP(qs) for a random 5. 



(6) 



For TOP(q) we non-deterministically pick the top element from 
one of the children and POP(g) extracts and returns that same 
element adjusting the upper bound if necessary. 

As mentioned before TOP(g) does not necessarily return 
the element with the maximum value. In order to find the top 
K elements FASTSUBS keeps popping elements from q and 
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computes their true values according to Equation [5] until at 
least K of them have values above the upper bound for the 
remaining elements in the queue. Table U gives the pseudo- 
code for FASTSUBS . 



FASTSUBS (S, K) 

1) Initialize upper bound queue q for context S. 

2) Initialize set of candidate words X = {}. 

3) WHILE \{x : x g X,£(x\S) > SUP(q)} < K 
DO X := X U {pop(<j)} 

4) Return top K words in X based on l(x\S). 



TABLE I 

Pseudo-code for fastsubs . Given a word context S and the 

DESIRED NUMBER OF SUBSTITUTES K, FASTSUBS RETURNS THE SET OF 
TOP K WORDS THAT MAXIMIZE£(:r|S). 



This procedure will return the correct result as long as 
POP(g) cycles through all the words in the vocabulary and the 
upper bound for the remaining elements, SUP(g), is accurate. 
The loop can in fact cycle through all the words in the 
vocabulary because at least one of the subexpressions, a(x), is 
well defined for every word. The accuracy of SUP(g) depends 
on the accuracy of the upper bounds for constituent terms, 
which are described next. 

C. Queues for conditional expressions 

Conditional expressions indicated by "{" in Equation[5]pick 
their topmost child whose a argument has been observed 
in the training corpus. Let qs be the queue for such a 
conditional expression and a E Cs be its children terms. Let 
Cmax = argmaxo-gCs SUP(go-) be the child whose queue has 
the maximum upper bound. The upper bound for qs cannot 
exceed the upper bound for q ainax because the value of the 
conditional expression for any given x is equal to the value of 
one of its children. Thus we define the queue operations for 
conditional expressions based on go- max : 

SVP(qs) = SUP(g CTma J (7) 
TOP(q S ) = TOP( g<Jmax ) 

D. Queues for sums of primitive terms 

As described in Section IIII-A1 the upper bound of a queue 
for a sum like f3(bx) +a(xd) is equal to the sum of the upper 
bounds of the constituent queues. It turns out that for sums of 
primitive terms, only the a term that has the candidate word x 
as an argument has a non-constant upper-bound. The language 
model defines (3 to be for any word sequence that does not 
appear in the training set. Therefore the /3 terms that have the 
candidate word x as an argument always have the upper bound 
0. Finally, the a and j3 terms without the candidate word x 
act as constants. 

For notational consistency we define upper bounds for the 
constant terms as well. Let A and B represent sequences of 
zero or more words that do not include the candidate x. We 
have: 

SUP( fc L4)) = a{A) (8) 
sup(^(B)) = (3(B) 



For (3 terms with x in their argument, many words from the 
vocabulary would be unobserved in the argument sequence 
and share the maximum (3 value of 0. In the rare case that 
all vocabulary words have been observed in the argument 
sequence, they would each have negative (3 values and would 
still be a valid upper bound. Thus FASTSUBS uses the constant 
as an upper bound for f3 terms with x. 

svp(q p {AxB)) = (9) 

Only the a term with an x argument has an upper bound 
queue as described in the next section. FASTSUBS picks the 
top element for a sum of primitive terms only from its a 
constituent]! Let q a be the queue for a sum of primitive terms 
and let 7 e C a indicate its constituents (a, (3, constant or 
otherwise). We have: 

SVP(q a ) = ^ sup (<77) ( 10 ) 

, v f TOP(<7 Q ) if the a term has an x argument. 

TOP(<7 CT ) = < vy ' , . 6 

y ' \ UNDEF otherwise. 

E. Queues for primitive terms 

FASTSUBS pre-computes actual priority queues (which sat- 
isfy the upper bound queue contract) for a terms that include 
x in their argument: 

svp(q a (AxB)) = maxa(Ar5) (11) 

X 

TOP(q a {AxB)) = argmaxa(Axi?) 

X 

Here A and B stand for zero or more words and x is 
a candidate lexical substitute word. SUP(q Q ) gives the real 
maximum, thus provides a tight upper bound. TOP(q Q ) is 
guaranteed to return the element with the highest value. 

The q a queues are constructed once in the beginning of the 
program as sorted arrays and re-used in queries for different 
contexts. The construction can be performed in one pass 
through the language model and the memory requirement is of 
the same order as the size of the language model. Candidates 
that have not been observed in the argument context will 
be at the bottom of this queue because a(AxB) = —00 if 
f(AxB) = 0. To save memory such x are not placed in the 
queue. Thus after we run out of elements in q a the queue 
returns: 

SVP(q a (AxB)) = -00 (12) 
TOP(q a (AxB)) = UNDEF 

IV. Correctness 

As mentioned in Section Hill the correctness of the algorithm 
depends on two factors: (i) the SUP(g) function should return 
an upper bound on the remaining values in q, and (ii) the 
POP(q) function should cycle through the whole vocabulary 
for the top level queue. 

The correctness of the SUP(g) function can be proved 
recursively. For primitive terms SUP(q) is equal to the actual 

2 Remember that the top value in an upper bound queue is not guaranteed 
to have the largest value. Thus ignoring the /3 terms does not effect the 
correctness of the algorithm. 
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maximum (e.g. for q a ), or is an obvious upper bound (e.g. 
SUP(qp(AxB)) = 0). For sums, SUP(g) is equal to the sum 
of the upper bounds for the children and, for conditional 
expressions, SUP(q) is equal to the maximum of the upper 
bounds for the children. 

To prove that POP(q) will cycle through the entire vocabu- 
lary it suffices to show that the queue for at least one child of 
q will cycle through the entire vocabulary. This is in fact the 
case because one of the children will always include the term 
a(x) whose queue contains the entire vocabulary. 

V. Complexity 

A exhaustive algorithm to find the most likely substitutes 
in a given context could try each word in the vocabulary as a 
potential substitute x and compute the value of the expression 
given in Equation [5] The computation of Equation [5] requires 
0(N 2 ) operations for an order N language model, which we 
will assume to be a constant. If we have V words in our 
vocabulary the cost of the exhaustive algorithm to find a single 
most likely substitute would be 0(V). 

In order to quantify the efficiency of FASTSUBS on a real 
world dataset, I used a corpus of 126 million words of WSJ 
data as the training set and the WSJ section of the Penn 
Treebank [8| as the test set. Several 4-gram language models 
were built from the training set using Kneser-Ney smoothing 
in SRILM |9] with vocabulary sizes ranging from 16K to 
512K words. The average number of POP(g) operations for 
the top level upper bound queue was measured for number 
of substitutes K ranging from 1 to 16K. Figure Q] shows the 
results. 
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Fig. 1. Number of iterations as a function of the number of substitutes K 
and the vocabulary size V. The solid curves represent results with vocabulary 
sizes from 16K to 512K. The horizontal dotted line gives the cost of the 
exhaustive algorithm for V = 64K. The diagonal dotted line is a functional 
approximation in the form K A V' 1_A ) for V = 64K and A = 0.5878. 

The time cost of FASTSUBS depends on the number of 
iterations of the while loop in Table Q] which in turn depends 
on the quality of words returned by POP(<7) and the tightness 
of the upper bound given by sup(g). The worst case is no 
better than the exhaustive algorithm's 0(V). However Figure Q] 
shows that the average performance of FASTSUBS on real data 
is significantly better when K -C V. The number of POP(<7) 



operations in the while loop to get the top K substitutes is 
sub-linear in K (the slope of the log-log curves are around 
0.5878) and approaches the vocabulary size V as K — > V. The 
effect of vocabulary size is practically insignificant: increasing 
vocabulary size from 16K to 512K less than doubles the 
average number of steps for a given K. 

As a practical example, it is possible to compute the top 
100 substitutes for each one of the 1,173,766 tokens in Penn 
Treebank with a vocabulary size of 64K in under 5 hours on 
a typical 2012 workstation^ The same task would take about 
6 days for the exhaustive algorithm. 

VI. Contributions 

Finding likely lexical substitutes has a range of applications 
in natural language processing. In this paper we introduced an 
exact and efficient algorithm, FASTSUBS , that is guaranteed 
to find the K most likely substitutes for a given word context 
from a V word vocabulary. Its average runtime is sub-linear 
in both V and K giving a significant improvement over an 
exhaustive 0(V) algorithm when K -C V. An implementation 
of the algorithm and a dataset with the top 100 substitutes 
of each token in the WSJ section of the Penn Treebank are 
available at |http://goo.gl/jzKH0| 
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3 Running a single thread on an Intel Xeon E7-4850 2GHz processor. 



